Add blob direct write with partitioned blob files#14457
Draft
xingbowang wants to merge 15 commits intofacebook:mainfrom
Draft
Add blob direct write with partitioned blob files#14457xingbowang wants to merge 15 commits intofacebook:mainfrom
xingbowang wants to merge 15 commits intofacebook:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a new blob direct write feature with partitioned blob files that writes blob values directly to blob files during
Put(), bypassing both WAL and memtable for large values. Only the small (~30 byte)BlobIndexpointer is stored in WAL and memtable. This reduces WAL write amplification, memtable memory usage, and blob write lock contention for large-value workloads.Motivation
With standard blob separation, full blob values are first written to WAL, then stored in the memtable, and only separated into blob files during flush. For workloads with large values (e.g., 4KB–1MB), this means the WAL and memtable carry the full value payload even though it will eventually be stored separately. This wastes WAL bandwidth, inflates memtable memory, and adds unnecessary write amplification.
Additionally, the existing blob file write path uses a single blob file writer per column family, which becomes a serialization bottleneck under concurrent write workloads. Partitioned blob files address this by spreading writes across multiple independent blob files, each with its own lock, enabling true parallel blob I/O from multiple writer threads.
Design
Write Path
DBImpl::Put()fast path: For single-key puts where the value exceedsmin_blob_size, the blob is written directly to a blob file and aBlobIndex-onlyWriteBatchis constructed, avoiding full value serialization entirely.DBImpl::WriteImpl()batch path: For multi-keyWriteBatchoperations, aBlobWriteBatchTransformeriterates the batch, writes qualifying values to blob files, and replaces them withBlobIndexentries before the batch enters WAL/memtable.BlobFilePartitionManager
A new
BlobFilePartitionManagermanages partitioned blob files for concurrent writes:blob_direct_write_partitions) each with their own mutex, reducing lock contention for concurrent writers.blob_direct_write_buffer_size > 0): Zero-copy buffering whereSlicereferences point directly into theWriteBatchbuffer. Background threads flush to disk in batches, amortizing syscall overhead. Includes backpressure with stall watermarks.blob_direct_write_buffer_size = 0): Immediate write-through for maximum durability.BlobFilePartitionStrategyinterface for key/value-aware partition assignment (default: round-robin).Flush Integration
BlobFilePartitionManager::SealAllPartitions()finalizes open blob files and injectsBlobFileAdditionentries into the flushVersionEdit, so blob files are registered in the MANIFEST atomically with the flush SST.Crash Recovery
DBImpl::Open(): Scans for blob files not registered in the MANIFEST (e.g., from crashes before flush), reads their headers to determine column family, validates records, and registers them viaVersionEdit. Runs regardless of currentenable_blob_direct_writesetting to handle DBs previously opened with the feature.BlobIndexentries pointing to these recovered blob files, ensuring no data loss.Read Path
DBIterandArenaWrappedDBIterextended to resolveBlobIndexentries from direct-write blob files.BlobFileCache→ blob file read.New Options
enable_blob_direct_write(bool, default: false) — master switchblob_direct_write_partitions(uint32, default: 1) — number of concurrent blob file partitionsblob_direct_write_buffer_size(uint64, default: 4MB) — per-partition write buffer; 0 = sync modeblob_direct_write_use_direct_io(bool, default: false) — O_DIRECT for blob writesblob_direct_write_flush_interval_ms(uint64, default: 0) — periodic background flush intervalblob_direct_write_partition_strategy(shared_ptr, default: round-robin)New Statistics
BLOB_DB_DIRECT_WRITE_COUNT— number of blobs written via direct writeBLOB_DB_DIRECT_WRITE_BYTES— bytes written via direct writeBLOB_DB_DIRECT_WRITE_STALL_COUNT— writer stalls due to backpressureBLOB_DB_COMPRESSION_MICROS— blob compression timingTesting
db_blob_direct_write_test.cccovering: basic put/get, multi-get, concurrent writers, compression (with Snappy availability checks), crash recovery, orphan recovery, WAL recovery, snapshot isolation, transactions (including 2PC), backpressure, multiple column families, file rotation, statistics, event listeners, file checksums, direct I/O, sync/deferred flush modes, and error injection.db_stressanddb_crashtest.pyintegration for continuous randomized testing.make checkpasses (39,454 tests, 0 failures).New Files
db/blob/blob_file_partition_manager.cc/.h— core partition manager (~1,700 lines)db/blob/blob_write_batch_transformer.cc/.h— WriteBatch transformation logicdb/blob/db_blob_direct_write_test.cc— comprehensive test suite (~2,000 lines)db/blob/blob_file_completion_callback.cc— SstFileManager and EventListener integration